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Abstract. The combination of several socio-economic data bases originating from 
different administrative sources collected on several different partitions of a geographic 
zone of interest into administrative units induces the so called areal interpolation problem. 
This problem is that of allocating the data from a set of source spatial units to a set of 
target spatial units. A particular case of that problem is the re-allocation to a single target 
partition which is a regular grid. At the European level for example, the EU directive 
'INSPIRE’, or INfrastructure for SPatial InfoRmation, encourages the states to provide 
socio-economic data on a common grid to facilitate economic studies across states. In the 
literature, there are three main types of such techniques: proportional weighting schemes, 
smoothing techniques and regression based interpolation. We propose a stochastic model 
based on Poisson point patterns to study the statistical accuracy of these techniques for 
regular grid targets in the case of count data. The error depends on the nature of the 
target variable and its correlation with the auxiliary variable. For simplicity, we restrict 
attention to proportional weighting schemes and Poisson regression based methods. Our 
conclusion is that there is no technique which always dominates. 

Keywords. Areal interpolation, spatial disaggregation, pycnophylactic property, spa¬ 
tial misalignment, accuracy. 

1 Introduction 

The analysis of socio-economic data often involves the integration of various spatial 
data sources. Those data are often independently collected by a variety of offices and for 
different purposes. The zonal set systems used by distinct offices are rarely compatible 
and this leads to many difficulties. The problem of merging data bases on different 
spatial supports is called the areal interpolation or basis change problem (Goodchild 
and Lam 1980). In France, the need for official statistics at a more and more refined 
territorial level has been recognized by INSEE. In Europe, one of the objectives of the EU 
directive “INSPIRE”, for INfrastructure for SPatial InfoRmation, is to harmonize quality 
geographic information to support the formulation and evaluation of public policies and 
activities which directly or indirectly impact the environment. Many statistical methods 
are proposed in the literature to handle this problem (dasymetric methods, regression 
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methods, smoothing techniques) and the reader is referred to Do et al. (2014) for a recent 
review of the simplest ones. The problem of their relative accuracy is most often treated at 
an empirical level (see for example Reibel and Bufalino (2005), Mennis (2006), Flowerdew 
and Green (1992), Flowerdew et ah (1991), Reibel and Agrawal (2007), Gregory (2002)). 
At the theoretical level, only few articles address this problem (Sadahiro (1999, 2000)) 
and this is the objective of this work. 

Comparing the accuracy of the different methods is difficult because the relative ac¬ 
curacy depends on several factors: nature of the target variable, correlation between the 
target and auxiliary variables, shapes of zonal sets, relative size between the two zonal 
sets,... In order to derive theoretical results, we need to consider simplifying restrictions. 
For this reason, in this document, we first of all restrict attention to data obtained from 
counts (see section [2]): they are frequent in the literature and cover most of the cases in 
the socio-economic applications. We also restrict the comparison to the simplest classes 
of methods which are the dasymetric and the regression ones. At last, we make the as¬ 
sumption that target zones are nested within source zones. Indeed, this is not really a 
restriction since the intersections between sources and targets are always nested within 
sources and it is immediate to go from intersection level to target level by aggregating the 
predictions as we will see later. In section [2j we define what we mean by data obtained 
from counts and we introduce a mathematical model adapted to this case. In order to 
illustrate the methods and check our theoretical results, we present two sets of simulated 
data that we use later. In section |3j we recall the formulas for the dasymetric and Poisson 
regression areal interpolation. Finally, in section [4| we compare the relative accuracy 
of areal weighting and dasymetric methods with finite distance results whereas in sec¬ 
tion [5j we compare the relative accuracy of dasymetric and Poisson regression methods 
with asymptotic methods. In both sections, we comment the results obtained on the toy 
examples presented in section [2j All proofs are in the appendix. 


2 Count data and Poisson point pattern model 

The variable of interest Y that needs to be interpolated is called the target variable 
and it needs to have a meaning on any subregion of the given space. Y D will denote the 
value of the target variable on the subregion D of the region of interest Q. 

In the general area-to-area reallocation problem, the original data for the target variable 
is available for a set of source zones S s (s — 1, • • • ,ns,) and has to be transferred to an 
independent set of target zones T t (t — 1, • • • , n T ). The variable Y Ss will be denoted by 
Y s for simplicity and similarly for Y Tf by Y t . The source zones and target zones are not 
necessarily nested and their boundaries do not usually coincide. 

Overlaps between the two sets are called intersection zones and denoted by A st for the 
intersection between the source S s and the target T t . For simplicity, Ya st will be denoted 
by Y st . Many methods involve the areas of different subregions (sources, targets or other). 
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We will denote by | A j the area of any subregion A. 

Most of economic data collected at regional level result from aggregating point data and 
are only released in this aggregated form. Intuitively, let us say that a point data set is 
a set of a random number of random points in a given region of geographical space. The 
collection of corresponding numbers of such points in given subdivisions of this region 
is a count data set. For example with census data, a population count on a given zone 
is the number of inhabitants of the zone. This number is obtained from the knowledge 
of the addresses of these people. The collection of coordinates of such addresses is the 
underlying point data set. Examples of areal interpolation of population or subpopulation 
counts can be found for example in Goodchild and Lam (1980), Langford (2005), Mcnnis 
and Hultgren (2006), Rcibel and Agrawal (2007). Other types of counts are encountered 
frequently, for example number of housing units in Reibel and Bufalino (2005). Another 
frequent type of count related variable is the number of points per area! unit associated 
to a point data set: it is a density type variable. Examples of areal interpolation of 
population densities can be found in Yuan et ah (1997) and Murakami (2011). An even 
more general type is when the variable is a ratio of counts such as number of doctors 
per patient. There is an easy one to one correspondence between a count variable and a 
density variable which allows to transform one type into the other so that any treatment 
of counts can be extended to densities and reversely. A count variable belongs to the 
family of extensive variables, which are variables whose value on a region is obtained by 
summing up its values on any partition into subregions (aggregation formula hereafter). 
A density variable belongs to the family of intensive variables, which are variables whose 
value on a region is obtained from values on any partition into subregions by a weighted 
sum (see Do et ah 2014 for more details). In the case of population density, the weights 
are given by the areas of the subregions of the partition. In the remainder of this paper, 
we will concentrate on pure count variables. 

We introduce a model for an extensive count variable by assuming that there exists 
an underlying (unreleased) Poisson point pattern Zy (in the population example, the 
positions of the individuals of the population) and that the target variable Y on a subzone 
A is the number of points of Zy in A. For a partition D*, i = 1,2,..., k of the region D, 
the extensive property is clearly satisfied 

k 

Yn = J2 Y v. 

i =1 

With the proposed Poisson point pattern assumption, for any zone A, Ya = JT 1 a(Zi) 
is a Poisson distributed random variable with mean = J A \z Y ( s )ds , where \ Zy is the 
intensity of the point process Zy. 

This model implies that Ya and Yb are automatically independent for all disjoint couples 
of subregions A and B due to the Poisson process nature. We could use point pattern 
models with interaction effects while retaining the extensive property but we rather devote 
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this article to this first case, keeping the interaction case for further developments. 

As we will see in the next section, some methods we want to compare (dasymetric and 
univariate regression) make use of an auxiliary information. For the auxiliary variable 
X to be relevant, there must be some relationship between the target variable and the 
auxiliary variable. In many cases a categorical information is used such as land cover: 
Reibel and Agrawal (2007) and Yuan et al. (1997) use land cover type data on a 30 
meters resolution grid, Mennis and Hulgren (2006) use 5 types of land cover obtained 
manually from aerial photography. Li et al. (2007) just use a binary information such 
as unpopulated versus populated zones. Reibel and Bufalino (2005) interpolate the 1990 
census tract counts of people and housing using length of streets as auxiliary information. 
Mugglin and Carlin (1998) exploit population to interpolate the number of leukemia 
cases. The use of a continuous auxiliary information can also be found: Murakami (2011) 
utilizes distance and land price to predict population density. In the rest of the paper, 
we concentrate on a single extensive auxiliary variable X that is also a count in order 
to be able to consider the accuracy of all methods simultaneously (more details at the 
end of section [3]) . Therefore it corresponds to another underlying point process with 
intensity A z x - 

The auxiliary variable A", has to be known at intersection level in the case of dasymetric 
and at the target level in the case of regression. We need to write a formal relationship 
between our target variable and the auxiliary information. The model we propose assumes 
that the following relationship holds between the two underlying point processes intensity 
functions 

Azy(s) = ot + j3\z x (s), (1) 

where s is location. Therefore, the following relationship holds between Y given X: at 
the level of any subset A of the region, the conditional distribution of Y A given Xa = Xa 
is given by 


Y A ~V(a\A\+Px A ) 


( 2 ) 


This relationship will be used at target level A = T and at source level A = S. This 
model in its general form will be called auxiliary information model (AIM). In this model, 
the intensity of Zy is driven by two effects: the effect of the auxiliary variable X and the 
effect of the area of the zone. If we look at target level, the target variable is Poisson 
distributed with a mean comprising two parts E(Yt) = ot\T\ + (3xt : the first part ct\T\ 
reflects the impact of the area of the zone T, whereas /3xt is the impact of the auxiliary 
variable. The linearity of the expected value of Y with respect to the area and to the 
auxiliary information is not canonical in a Poisson regression model for counts but in our 
case it derives naturally from ([!]). 

introduce two sub-models of model (j2| depending on the 


In sections 4.2 and 4.3 


we 


intensity function A z Y - We consider the case of a constant intensity (homogeneous model) 
and the case of a piecewise constant intensity (piecewise homogeneous model). 
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3 Methods 


3.1 Prediction techniques 

Do et al.(2014) classify areal interpolation methods into three groups: smoothing, 
dasymetric and regression based methods. We discard smoothing since it is concerned with 
continuous target variables which is not adapted to our count data model. We therefore 
focus here on the remaining two groups: dasymetric and regression based methods. 

Dasymetric is a class of methods using a weighting scheme to allocate the original 
data to the intersections and then applying an aggregation step to get to target level. 
The simplest method in the dasymetric class is the areal weighting interpolation which 
uses area as weighting scheme: the data is allocated to the targets based on the assumption 
that the target variable is homogeneous at source level: 

y<= E y*= E ^ Y - 

s:stlt 7^0 s:snt ^0 S 

Note that areal interpolation does not use any auxiliary information other than area which 
is usually available. 

The general dasymetric method is supposed to improve upon the areal weighting inter¬ 
polation method when an additional variable is known to be linked to the target variable 
leading to alternative weighting schemes. Voss et al. (1999) use road segment length and 
the number of road nodes for allocating demographic characteristics. Population, which 
is collected at fine levels in general, is often used as an auxiliary information for other 
variables like in Gregory (2002) or Mugglin and Carlin (1998). Instead of homogeneity, 
the dasymetric method with auxiliary information A" assumes that the target variable is 
proportional to the auxiliary variable at intersection level. 

* = E = E 

s:snt^0 s:snt^<h 

where X s = Ylt^st- This entails that X has to be known at intersection level, which is 
quite restrictive. 

Concerning the regression based methods, there are several types of regression based 
methods also involving auxiliary information (see Do et al, 2014). Given the nature of 
the target variable in our model (J2]), we concentrate on the Poisson regression presented 
in Flowerdew et al. (1991) for the purpose of predicting population (which is an extensive 
variable) with categorical auxiliary information. Based on model (|2]), a Poisson regression 
with identity link is performed at source level yielding estimators a, f3 for the parameters 
a and j3. 

The prediction of the target variable at intersection level is then obtained by 

Y* EG = a\A st \+pX 8t (3) 
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and the final step aggregates intersections predictions at target levels. The regression 
based methods can be considered as more powerful than the dasymetric methods in the 
sense that they can incorporate multivariate auxiliary information and that the knowl¬ 
edge of auxiliary information is only needed at target level and not at intersection level. 
However, the purpose of this paper being to compare the accuracy of dasymetric methods 
and Poisson regression methods from a methodological point of view and for the case 
of extensive count data, we therefore concentrate on the unidimensional auxiliary count 
variable case. 

One property often quoted concerning these methods is the pycnophylactic property. 
This property requires the preservation of the initial data in the following sense at some 
geographical level: at source level for example, it means that the predicted value for 
source S s obtained by aggregating the predicted values on intersections with S s should 
coincide with the observed value on S s . The enforcement of this property will allow us to 
introduce an improved version of the basic Poisson regression method. 

3.2 Prediction error criteria 

The accuracy assessment necessitates the choice of a prediction error criterion and of 
a geographic level. In this framework, examples of criteria are root mean square error or 
mean square error (Sadahiro 1999, Reibel 2006,...) at regional level (that is the union of 
all sources), or relative absolute error at target level (Langford, 2007). We denote by MET 
a generic method of prediction and let MET be DAW for the areal weighting method, 
DAX for the general dasymetric method, REG for the Poisson regression method and 
ScR for the scaled regression method which will be presented later in section [5} We recall 
that we assume all target zones are nested within source zones. 

In section [4j we use mean square error at source level to compare the areal weighting and 
dasymetric methods. For method MET, the source level error is then computed as follows 

Erf = ^ Erf ET = E(y t M£T - Y t f (4) 

tcs tcs 

and the overall regional error is 

Er MET = YY e (^ MET - Y t) 2 (5) 

S teS 

In section [5j we use mean square error at target level 

Erf J ET = E (Y t MET - Y t f (6) 

to compare the dasymetric and Poisson regression methods. 

In general, we will also use the relative error criterion defined as 
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Re"®’ 


E(y s ) 


(7) 


where Yle A s IET is the relative error of method MET at source level for source S with 
method MET. 


Relative accuracy of areal weighting and dasymet- 
ric: finite distance assessment 


Let us briefly summarize the hirelings of the assessments found in the literature for the 
comparison of general dasymetric and areal weighting. For empirical assessments, several 
authors report that the dasymetric method improves upon areal weighting. Depending on 
the context, the improvement varies: Langford (2007) reports improvements of 54%, 57%, 
and 59% better depending on the auxiliary information used; Reibel and Bufalino (2005) 
reports improvements of 71.26% and 20.08% with street length auxiliary information for 
the two target variables: housing units and total population. For theoretical assessments, 
Sadahiro (1999, 2000) compares the areal weighting interpolation and the point-in-polygon 
method with a theoretical model. We did not mention yet the point-in-polygon method 
because it is a very elementary one consisting in allocating a source value to the target 
which contains its centroid. Using a stochastic model, he finds that the factors that 
impact the accuracy of the methods are the size and shape of target and source zones, 
the properties of underlying points. 


In this section, we prove some theoretical properties in subsection |4.1[ with two particular 
cases in 


4.2 and 4.3, and a toy example in 4.4 


Since targets are nested within sources, the predictors of the two methods depend only on 
the source that contains the concerned target zone. For that reason, we focus on studying 
one source zone denoted by S. For a target T in S, the two predictors are as follows 


Yn 


DAW 


and 


Yn 


DAX 


—Ys 


—Y s 

Xs 


( 8 ) 

(9) 


4.1 General auxiliary information model 


Lemma 4.1 gives the expression of the prediction bias and variance in model AIM for 
areal weighting interpolation and dasymetric interpolation at target level. 
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Lemma 4.1. In model AIM, the prediction biases and variances of areal weighting inter¬ 
polation and dasymetric methods are given by 


E (Y T °* W -Y T )=l3xM- X -l) 

| %S 

¥,(yP ax - Y t ) = a|S|(— - jl}) 

xs |S| 

Var(Y T DAW - Y t ) = /te s (F[ - ^) 2 + /te T (l - ^) + a|T|(l - F[) 

\s\ X S X S |S| 

Var(Y° AX - Y t ) = a|S|(|T - ^) 2 + fix T (l - ^) + a|T|(l - S 

|D| X S Xs \J\ 


( 10 ) 

( 11 ) 

( 12 ) 

(13) 


First note that the two biases have opposite signs, in other words, if the areal weighting 
interpolation method underestimates then the dasymetric method overestimates and vice 
versa. This fact can be interpreted as follows: while the true intensity comprises two 
effects, these methods treat only one of them which causes the contrast. Although the 

|T| xt 

signs of biases are opposite, their absolute values are both proportional to —— - 

\S\ x s 

which measures the divergence between the share of the auxiliary information in target 
T with respect to S and the share of the area of T with respect to S. This divergence is 
also proportional to —^ and hence can be viewed as a distance to proportionality 

between area and auxiliary information. The bias of the areal interpolation method with 
its assumption of homogeneity is independent in the areal effect a\S\ but is proportional 
to the ignored auxiliary information effect, and reversely the dasymetric method which 
focuses on the effect of the auxiliary information gets rid of the /3xs in its bias but is 
proportional to the ignored areal effect. We will build on this to propose a new method 
in the next section. 

x t \T\ 

The two variances have a common part /3xt( 1-) +ckIT 1(1 — ——) which we can interpret 

xs P| 

as the loss of information when transferring data from a large source zone to a smaller 
target zone. For the remaining part, the same explanations as for the bias stands. Both 
variances have a parabola shape with respect to xt (respectively to |T|) with a maximum 

at xt = t, x s, (resp. \T\ = -|S'|): we can say loosely that the variances are maximum 


when the target zone is around a haft of the source. They vanish when the target zone 
is either empty or coincide with the source which makes sense. The reallocation to a 
larger target intuitively decreases the difficulty of the disaggregation problem except that 
the error also depends on the expected number of points so we should turn attention to 
relative error. If one divides the variances by the square of the expected number of points 
in the target zone E(Yr), we can see that the relative error will tend to zero as E(Yr) 
tends to infinity. 



Since the dasymetric method is pycnophylactic, the bias at source level is zero. Lemma 


4.2 reports the expression of the prediction variances in model AIM for areal weighting 


interpolation and dasymetric interpolation at source level. 

Lemma 4.2. In model AIM, the variances of areal weighting mid dasymetric methods at 
the source level are 


Var° AW 


fix s J2 


V arg AX = a 


isiE 


m 

\s\ 

,\T\ 

|S| 


Xs' 

2 

Xs' 


+ P%s( 1 - ^2 ~y) + a|S|(l - ^2 


\S\ 2) 

+px s (i - ^2 222 + | 


T X S 


T X S 


(14) 

(15) 


To get an insight at impact of the number Ut of the target zones, we consider the special 

\T\ 1 

case where all targets have the same size. In this case, — r = — for any T, and we get 

|6| n T 


Var° AW = (l-— )(a\S\+/5x s ) 


Var 


DAX 


Ht 

(1 - — )( Q |S| + Px s ) + T(A- — )(a|S| - Pxs) 

rim L ' T* „ rim 


ut 


T X S 


ut 


It is obvious that the larger the number of the target zones, the larger the variances, 
which agrees with our conclusion concerning the size of targets. Indeed, when the area of 
the target zones gets smaller, the error on each target decreases but the total error at the 
source level gets larger due to the effect of the number of the targets. 

We are now ready to compute the mean square error difference between the two meth¬ 
ods. We introduce the following quantities which quantify a relative contribution of the 
corresponding effect to the overall mean at the geographical level of a subregion A: 


Ia{X) 


fix A 

a\A\ + fix A 


Ia(X) is the relative contribution of variable X and similarly /a(|-|) = ,,| 4 |^ t . 4 is the 
relative contribution of the areal effect. 

The imbalance between the two effects is measured by the difference 


A a = I a (\.\)-Ia(X) 


a\A\ — (3 xa 
E (Y a ) 


This quantity ranges between —1 when there is a pure X effect and 1 when there is a 
pure areal effect with a value of zero when the two effects are of equal size. 

We can derive from lemmas 4T and T2 the expression of the absolute and relative errors 
of the two methods at source level as a function of the relative contributions terms. 
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Theorem 4.3. 


(16) 

( 17 ) 


Er° AW =I S (X) 2 E(Y S ) 2 D + I S (X)E(Y S )(D + B) + / s (|.|)E(y s )C 
(Re1 AW Y = I s (XfD + ^ [/ S (A)(D + B-C)+C} 


(Re D s AX ) 2 = /s(|.|) 2 B + ^|/ S (|.|)(C - B + C) + B] (18) 

, „ |T| 2 x 2 t \T\ 2 

where D = 77 W-) , A = 1 - 2^t E = 1 - ttttx are positive. 

|5| x s xf |S| 2 

Note that B, C and D only depend on the the geometry of the problem and the auxiliary 
information, whereas the relative contribution terms and E(Y#) depend on the coefficients 
a and fd. It is interesting to mention the symmetry between the two methods which stands 
clearly in these formulas when we exchange the two contributions terms. One can derive 
from this theorem the difference between the relative errors of the two methods 


(Re ^ AW ) 2 - (Res AX ) 2 


—D * A s (l + 



(19) 


which turns out to be clearly proportional to the imbalance term A g. Similarly, one can 
approximate the ratio of the two relative errors when E(Ys) is large on the target A = T 
and on the source A = S by 


Re° AW 

Re^ x 


Ia{X) 

U M) 


( 20 ) 


This ratio roughly ranges from 0 to +oo at the extreme cases of a pure A" or areal effect 
showing that one can outperform the other by a large amount. Let us now turn attention 
to the difference between the two errors. 


Theorem 4.4. The difference between the errors of areal weighting and dasymetric meth¬ 
ods on a target zone T is 

Er° AW - Er? AX = (H - ^) 2 A 5 E(Y S )(E(Y 5 ) + 1) 

\S\ x s 

The important conclusion of this result is that the sign of the difference in error agrees 
with the sign of As, i.e. the sign of (a|S| — (3xs )• Moreover, the stronger the effect of 
the auxiliary information Js(df) is, the better the dasymetric method and the larger the 
difference between the two methods. 
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This computation result leads to a very interesting consequence: if one of two effect 
dominates on a given source, the related method wins on all target zones belonging to the 
source. It also shows that two methods will have the same accurracy if the two effects are 
balanced or the auxiliary variable is homogeneous. 

The normalized difference between the two effects As clearly determines which method 
is best. 

At this point, it seems natural to look for a linear combination of these two predictors 


Yf( w ) = wY" AW + (1 - w)Y£ 


DAW 


rDAX 


( 21 ) 


which would combine their good properties. It turns out that in the class of linear 
combinations of areal weighting and dasymetric predictors, the best predictor is given 
by the following theorem 

Theorem 4.5. In model AIM, the best predictor in the sense of minimizing (with respect 
to the weight w) the errors on any target zone T in the class (21) is 


Yf = Y£(w*) = 


C, 

T 


a\T\ + (3xt 


Ys 


for w* = 


a\T\ 


cr|S'| + (dx s 


a | S'| + /3x s 

. Its error and relative error are respectively given by 
At(As — Xt) 


( 22 ) 


Er% = 


(Re C sf = 


Ac 


-[A 2 s D + 2A S (C -B) + D + 2B + 2C] 


(23) 

(24) 


4E (Y s ) 

Moreover, this predictor coincides with the best linear unbiased predictor in model AIM. 

Because Ar ( Ag ~ Ar) = Var(Yf? AX -Y T ) - X s (— - ^) 2 = Var(Y° AW -Y T ) - A 5 ({^} - 

x s X s |S| 


Ac 


A 


— ) 2 , the prediction error of the best predictor is smaller than the variances of the other 
A s 

two methods and the distance is the more important that the auxiliary information is 
further from homogeneity. Of course, Yf is not a feasible predictor since it depends on 
the unknown coefficients a and /3 of model AIM but we will use it as a benchmark tool 
on the one hand and we will relate it later on to the regression predictor. If we look at 

Xf ^ \ X s 


the error at the level of source S, we have that Er^ = A s 


Er^<Xs 


n T (S) 


, where 


n T (S) is the number of targets in source S, and hence this predictor’s accuracy is worse 

A s 

when all targets have the same expected number of points ———. It is interesting to 

n T {S) 
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note that the relative error (at source level S) of the best predictor tends to zero as the 
expected number of points in the source S tends to infinity, which was not the case for 
the dasymetric methods. For a fixed expected number of points in a given source S , we 
can easily find the value of the imbalance A g which minimizes the relative error of Yf 


A* = 


B-C 

D 


i'IZI'i 2 _ C —'\ 2 

V ' ( 1-^1 _ Xt )2 

xs ] 


and thus derive a lower bound for the relative error 


for a given geometry. 

Because intuitively, it is natural to think that areal weighting should be outperformed by 
dasymetric when the underlying process is inhomogeneous, we consider the two cases of 
homogeneous and piecewise homogeneous submodels. 


4.2 Homogeneous model 

Areal weighting interpolation is a simple and natural rule which is based on the as¬ 
sumption that the target variable is homogeneous at source level. Indeed in model AIM, it 
is equivalent to assume that the point process is homogeneous and its intensity is therefore 
constant (equal to a > 0) leading to: 


Y a ~ V(a\A\). 


Substituting (3 = 0 in (10), (12), (14) we get the bias, variance and error in this case: 
E (Yrf AW - Yt) = 0 


Er; 


DAW 


= Var(Yf 


DAW 


Yr)= a \T\{\- jlf) 


Er» AW = Var D s AW = a|S|(l 


m- 


^\s \ 2 


) 


1 \T\ 2 

Since — < YM -—— < 1, the error at source level is maximum when all target zones 
n T |S| 

have the same size, and minimal when there is a unique target which coincides with the 
source. 


Substituting (3 — 0 in (22) leads to the conclusion that the best linear unbiased 


predictor in the homogeneous AIM model is given by the areal weighting method which 
is a natural result. Let us now turn attention to a very simple non homogeneous model to 
illustrate the intuitive fact that the areal weighting interpolation method is not the best 
choice in a non homogeneous situation. 
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4.3 Piecewise homogeneous model 

Suppose the source zone S comprises two homogeneous subzones C\ and C 2 called 
control zones with intensities and a 2 respectively. In this case, we get 

Y a ~V{o*\A\) 


where AcC, with * = 1,2. For simplification reasons, we assume the target zones to be 
nested within the control zones. The results of lemmas 4.1 and 4.2 give in this case 


K(Yr^ AU — 1t)t:TcCi — _ a l)\^2\ 

E (Y t daw - Y T ) T:TCC , = FJ(a, - a 2 )|C x 
Varg AW = ct 1 |C r 1 |(l — £ ®)+a a |C 2 |(l- £ S) 

T:TcCi I I T:TcC 2 ' ' 

Erg AU =Varg AW + ^ l^(a 2 — ol 1 ) 2 \C 2 \ 2 + ^ rJ^(oi 2 - a 1 ) 2 \C 2 \^ 

T-.TcCi ' ' T:TcC 2 ' ' 

The variance has a similar structure to the one of the homogeneous model. The bias 
clearly shows that the difference between the two intensities of the subzones will drive the 
size of the error. 


4.4 Toy example 

In order to illustrate our findings, we use a simulated toy example. We intentionally 
drop the assumption that targets are nested in sources which was made for mathematical 
convenience and this will allow us to test the robustness of the results with respect to 
that assumption. On a square grid with 25 cells, we design three sources and seven 
targets as unions of cells. On Figure [lj we see the design of sources and targets together 
with the cell counts for two target variables Y\ and Y 2 and one auxiliary variable A" 
(for one particular draw). To generate A", we simulate a Poisson point process with an 
inhomogeneous intensity. We then recover the counts at the cell level to get the auxiliary 
information. The two target variables are then generated according to their relationship 
with the auxiliary variable (model Q) and the source values are obtained by aggregation 
of cells. The true value of two target variables is also shown at target level for accuracy 
comparison for one particular draw. For Y \, we use the set of parameters a = 80 and 
(3 = 1 and for Y 2 we use a = 0 and d — 1 so that the area has a strong impact on Y\ and 
that T 2 is only driven by X. Conditionally upon one draw of X (for which we observe 
1011 points), we draw 1000 repetitions of Y\ and Y 2 and present the relative error and 
the error at target level on Figure [2] 
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(e) Y 1 } Y 2 at sources (f) A' at intersections (g) Y\,Y 2 at targets 
Figure 1: Toy example: Data at source, target, intersection zones. 

The accuracy criterion is an average of the error over all 1000 draws. We present the 
relative error and the error at target level on Figure [2] 







(a) DAW for Y\ (b) DAX for Y t (c) DAW for Y 2 (d) DAX for Y 2 

Figure 2: Toy example. Comparison of areal weighting interpolation and dasymetric 
method. 


Figure [2] is in agreement with the theoretical results: areal weighting interpolation is 
better than dasymetric for Y\ for which the areal effect dominates and worse for Y 2 for 
which the auxiliary information effect driven by X dominates. We can see that Y 2 is less 
homogeneous than Y\ on Figure [I] (a) and (c): at the right-bottom target zone T 5 has 
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Tabic 1: Square root of overall regional errors 


Methods 

Yi 

r 2 

Areal weighting interpolation 

201 

197 

Dasymetric 

452 

26 

Regression 

55 

33 


very small Y 2 counts. The dasymetric predictor is therefore very good for Y 2 (Figure [2] 
(d)). To be more precise, we compute the ratio of the two average errors at target level 
for the two methods and it shows that areal weighting is best for Y\ with a ratio of square 
root of errors of ||| = 44%, whereas dasymetric is best for Y 2 with ratio of square root 
of errors of ^ = 13%. Table 0 reports the square root of the overall regional error (from 
formula (|5j)) for the three methods: areal weighting, dasymetric and Poisson regression. 
For Yi, the regression method is best, for Y 2 dasymetric is best because the impact of X 
is strong (almost no areal effect). 

For the practitioner, an important question is to be able to guess which method will 
perform better in a given situation. We might believe that a good correlation between 
Y and X is a sign that dasymetric based on X will perform better than areal weighting. 
However in our case the correlation between Y\ and X is 0.94 and the correlation between 
Y 2 and X is 0.98 which shows that this is a bad idea to rely on correlation. We could look 
at a measure of homogeneity to predict that areal weighting is the best method: in our 
case, the Gini coefficient of Y\ is 0.14 and of Y 2 is 0.40 which goes in that direction. As we 
have seen in Theorem |4.4 the sign of the imbalance index at source level A $ determines 
which method is best (see Table [2]). 

To be more precise, let us examine the results of Table [3] in comparison with Theorem 
Theorem 4.4 shows that the difference of errors at target level is influenced by 


4.4 


three factors: the mean number of points of the source, the imbalance of the source and 
the inhomogeneity of the auxiliary information of the given target. Let us look at each 
impact. For the influence of the inhomogeneity of the auxiliary variable, we compare 
targets of a source Si for example. The first two impacts are constant (1164 points and 
0.10 imbalance), and we see that the more homogenous the auxiliary information is (in 
increasing order %i 6 , %i 3 , %i 7 , %i 2 ), the more distant the two methods are (18, 20, 1728, 
2529 respectively). To examine the impact of the imbalance, we consider intersection 
zones Ayj and A 33 which are nested within two sources with similar number of points (Si 
with 1164 points and S 3 with 1168 points respectively). Even though their inhomogeneity 
are not very different ( 0.12 vs 0.10), the difference of the errors on A 33 is 3.7 times larger 
than on An. This fact is explained by the distance between the two imbalances of Si and 
S 3 (0.10 vs 0.51). The last but not least effect is the number of points of the source zones. 
The comparison between A 24 and A 35 shows its impact: they have similar inhomogeneities 
(0.20 vs 0.19), not too different imbalances (0.41 vs 0.51), but the difference in errors for 
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Table 2: Imbalance index at source level 


Source zones 

Si 

s 2 

S 3 

Vi 

0.10 

0.41 

0.51 

y 2 

-1 

-1 

-1 


A 35 (25707) is three times larger than the difference in errors for A 24 (7938) and this is 
linked to the discrepancy in the mean numbers of points (1168 vs 679). As Table [3] shows, 
the combination of the three impacts is very complex. In other words, choosing between 
the areal weighting interpolation and the dasymetric is not an easy problem. 


Sources 

Si 

s 2 

S 3 

E(Fi) 

1164 

679 

1168 

Imbalance 

0.10 

0.41 

0.51 

Intersections 

A 12 

A 13 

Aie 

A±7 

A 21 

A 24 

A 25 

A31 

A 32 

A33 

CO 

A 35 

A 37 

pT N71 

1 I <5| x* 1 

0.14 

0.01 

0.01 

0.12 

0.03 

0.20 

0.18 

0.02 

0.17 

0.10 

0.06 

0.19 

0.09 

Er DAX — Et uaw 

2529 

20 

18 

1728 

135 

7938 

5999 

199 

19858 

6435 

16817 

25707 

2240 

Re VAX /Re VAW 

1.5 

1.3 

1.1 

1.4 

1.8 

5.4 

5.4 

3.1 

8.8 

7.5 

8.7 

OO 

bo 

6.0 


Table 3: Errors at intersection level for Y\ 


5 Relative accuracy of the other methods: asymp¬ 
totic assessment 


Let us now try to extend the comparison to the Poisson regression method. This 
cannot be done anymore by finite distance methods and so we introduce an asymptotic 
framework. Model ([ 2 ]) yields at source level 

Y a ~V{a\S a \+Px a ) (25) 


where x s = Ylt-tnsM x st- Besides the Poisson regression predictor defined by ([3]), inspired 
by Theorem |4.5| , we propose a new predictor called scaled Poisson regression predictor 
defined as follows 


yScR = &\A st \+(3x st 

* st ^ I o 1 n * S') 

a | ,5 S | + fjx s 


(26) 


where a and f3 are the estimators of a and j3 obtained through the Poisson regression 
at source level. Note that if model ([TJ) contains only one of the two effects (that of X 
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for example), then it is easy to see that the predictor of the scaled regression method 
coincides with the dasymetric method (corresponding to A"): 


vScR _ P' X T v _ -CrDAX 

I T — a. I s — J- T 

px s 


In section 13 we establish the asymptotic properties of the estimators a and j3 and these 
results will enable us to compare the predictors in section A2 Section A3 illustrates these 
results on a toy example. 


5.1 Estimators of the regression coefficients 

In this section, we adapt proofs from Fahrmeir and Kaufmann (1985) to establish the 
consistency and asymptotic normality of the estimators &, f3. We first need to describe 
an asymptotic framework. To be realistic, we assume that the whole region 0 is fixed 
and that the number of source zones ns (hereafter denoted by n) increases to infinity. 
In this section, the source zones will be denoted by S n ^ : i = 1,2, ...,n and = U 
Because of the extensive property of X, we also assume a similar property of X n i : the 
total auxiliary information on the region remains constant Xq = Y2 i x n,i- In order to 
get a consistent regression however we need the amount of information at source level to 
increase and we thus assume that the intensity of Y increases with a rate k n —> oo so 
that 


Y A ~V(a\A\ + Px A ) 

where |A| = k n \A\,x A = k n x A - 

A I J z ni = f With these notations we have Aa = 

XaJ ’ V X ^i J 

iZ Al Z A = (1^1 ) = k n Z A and 1 a V(k n X A ). The true value of the parameter 7 will be 
\% A 

denoted by 7 0 = 

The log likelihood function l n ( 7 ), the score function s n ( 7 ) and the information matrix 
F n ( 7 ) are then given by 

n 

I'nipf) ^ ^ Vn.i Z n i) 7 Z n i ln(j/ nji !) 

i= 1 



Let 7 = 


j Z A — 
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«n(7) 
Fn( 7) 


dl n {l) 
dl 7'^ 

COVy(s 


E ^n,i ^ 

~ ~ Un,i ^n,i 


71 ry ryl 

(7)) = 2 ^ 

7 'Z n ,i 


Differentiation of the score yields 


Mi) 


ds n (i) 

<9y 


71 ry ry! 

h orZl 2 


Vn 


It is easy to see that E 7 (s n ( 7 )) = 0, E 7 (if n ( 7 )) = F n (i). We further simplify the notations 
and use s n , F n , H n , E instead of s n (j 0 ), F n (j 0 ), H n { 7,,), E 7o . It is clear that the matrix H n 
is positive definite and therefore the log likelihood function is concave which leads to a 
unique minimum. In the sequel, we also need the square root F n of the symmetric 
matrix F n . i.e. F„ 1 / 2 F / /2 = F n . 

Our asymptotic framework differs from that of Fahrmeir and Kaufmann (1985) in the 
sense that at each step they have one new observation whereas in our case at each step 
all observations are new and we have one more than at the previous step. For this reason, 
we modify slightly their conditions and assume that 

(Cl) C Z\/n,i with Z is a compact set. 

(C2) A minimi Z'niZ n ,i) —> c>o as n —> oo where A mm (IF) denotes the minimum eigenvalue 
of the matrix W. 


Condition (Cl) is satisfied if there exists two positive numbers < 7 , c 2 (note that | \Z Hji \\ Z 0) 
s.t. 

Ci < 11 Fi n ,i 11 < c 2 (27) 

In that case, the number of source zones increases with the rate of growth of the intensity 
at a similar rate and the number of points in one source zone is quite stable during the 
change process. 

Under these conditions, we get the following asymptotic behavior for the Poisson regres¬ 
sion coefficients. 


Theorem 5.1. Under conditions (Cl) and (C2), the following statements holds for the 
Poisson regression estimator % of 7 

(i) In —> P lo (weak consistency) 


(ii) Fn 2 (i n — lo) ->^(0,1) (asymptotic normality) 

In the next section, we use these results to study the asymptotic behavior of the predictors. 
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5.2 Predictors 


In this section, we consider the asymptotic properties of the following two predictors: 
the regression predictor ([3]) and the scaled regression predictor (26). We prove that 
the scaled regression predictor is asymptotically as accurate as the unfeasible composite 
predictor. We also compare these two methods with areal weighting interpolation and 
dasymetric interpolation. 

The first proposition is concerned with the pycnophylactic property, which is of interest 
in the areal interpolation literature. It shows that, whereas it is satisfied by the scaled 
Poisson regression predictor, it is not satisfied at source level by the ordinary Poisson 
regression predictor but only at region level. 


Proposition 5.2. The scaled Poisson regression predictor satisfies the pycnophylactic 
property at source level. The ordinary Poisson regression predictor is pycnophylactic at 
region level and asymptotically pycnophylactic at source level. 


To prove proposition 5.2 we need the following asymptotic normality result for the target 
variables 



~^d m, i) 


(28) 


We now turn attention to the asymptotic behavior of the prediction error for the ordinary 
Poisson regression predictor. 


Theorem 5.3. The asymptotic normality of the prediction error of the Poisson regression 
predictor at source level is given by 


yREG _ y 

1 mi 1 n.i 



—td A/"(0, 1) 


If we also assume a lower bound for Z T , the following similar result at the target level 
holds 


Y* EG - Y t 


-P-d A/"(0,1) 




The next result is about the quadratic prediction error and relative prediction error of 
the Poisson regression predictor. 


Theorem 5.4. For any rj > 0, there exists a sequence of sets {Qiji : P(<5i) —> 1 such that 

-r, + j' 0 Z T < E(Y)? eg - Y t ) 2 1q, < V + i 0 Z T 
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If the number of target zones contained in one source zone S n ^ is bounded, the relative 
error at source level can be approximated by 


Re; 


,REG 

n,i 


\/E(XrM 


“ 1 “ 


(29) 


Because Var (Yf) = E(Yt) = 7 ' 0 Zt, this theorem says that the quadratic prediction error 
of the regression predictor is asymptotically equivalent to the variance of the underlying 


process. Equation (29) shows that the relative error of the regression predictor is going to 


be small when the number of points on a source zone is large. However, this number being 
bounded by condition (Cl), this relative error cannot converge to zero in this framework. 
Let us now turn attention to the difference between the relative prediction errors of the 
Poisson regression method and that of the areal weighting and the dasymetric methods. 
If the target zones are nested within the source zones and the number of target zones 
contained in one source is bounded, we get the following approximation at source level 
for the differences between the relative errors of the methods when E(Y^j) are large: 


[(R<f G ) 2 

[(R<f G ) 2 


(RC^) 2 ] 
(R C x ) 2 ] 


-(l + A n ,) 2 ^(^- 

j, | ^n, i 


X T n 2 
n,i 

X T y. 

%n,i 


(30) 

(31) 

(32) 


This result shows that, among the three methods: areal weighting, dasymetric and Poisson 
regression, regression outperforms the other two methods asymptotically (negative sign). 

\T\ xt 

However, from the proof in the annex, we can see that if ( .. — ——) = 0 then the 

IQ -I %n,i 

regression is less accurate than areal weighting and dasymetric asymptotically so that 
none of them is always dominant. 

For areal weighting and dasymetric predictors, we have seen that if one method is better 
on one target, then it is also true on all targets contained on the same source zone. The 
difference between the accuracy of the regression method and the other two methods 
depends on the difference of ratios — the higher this difference, the larger the 
difference between regression and the other two. 

The fact that the regression predictor doesn’t satisfy the pycnophylactic property is not 
surprise but the fact that it does satisfy this property on the whole region is interesting. 
The idea of scaling to obtain the pycnophylactic property can be found also in Yuan et. al. 
(1997) for ordinary linear regression without theoretical justifications; we have extended 
it to the Poisson regression case and provided some theoretical motivation for it. 
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We now turn attention to the scaled regression and prove it is better than the unsealed one 
and that its accuracy can be approximated by that of the unfeasible composite predictor. 
The first lemma proves an asymptotic equivalence between the scaled regression predictor 
and the unfeasible composite predictor. 

Lemma 5.5. For any target T, 


Y* cli - F t g 0 


(33) 


The next result is about the quadratic prediction error of the scaled Poisson regression 
predictor. 

Theorem 5.6. For any r/ > 0, there exists a sequence of sets {Qi}i : P(Qi) —» 1 such that 


-V + Z T lo - < E {YS cR - Y T ) 2 l Qt < V + Z Tlo 


Zn,i~lo 


(ZtIo) 

ZnFto 


Since Er^ = Zt^o — 


XtIqY 

Zn,i~1o 


, this theorem shows that the quadratic prediction error of 


the scaled regression predictor is asymptotically equivalent to the one of the composite. 
Consequently, the scaled regression method is the best among the areal weighting, the 
dasymetric and the regression predictors. 


5.3 Accuracy: simulation assessment with a toy example 

We devise a simple simulation to illustrate these results. On a square region 0 with 
16 x 16 cells, we build three systems of sources with respectively 14, 7 and 4 sources (see 
Figure [5~3| ) . We simulate two Poisson point patterns (our auxiliary information) with an 
expected overall number of points of 100, 000: X\ is very inhomogeneous (Gini coefficient 
of cell counts of 0.74 with 100,247 points) and X 2 is very homogeneous (Gini coefficient 
of cell counts of 0.03 with 100,008 points). 

Target variables are then generated following model ([ 2 ]). For each of the auxiliary vari¬ 
ables, we choose four couples of coefficients a, (5 to study the effects of imbalance so that 
we get eight different target variables. Table [4] reports the minimum, maximum and 
average imbalance for each case and for each system of source zones. 

We then apply the four considered methods (areal weighting, dasymetric, Poisson 
regression and scaled Poisson regression) to transfer the data from each of the three 
systems of source zones to cell level. For each case, we generate the data 1000 times, and 
calculate prediction errors for each method and each iteration. Table [5] (respectively Table 
[6]) reports the average absolute square root of prediction errors (respectively the average 
absolute square root of relative prediction errors). The two tables also present the mean 
of the target variables at region level E(Yq) = a|fi| + /3xq (because it appears in Theorem 


5.4) and the theoretical composite prediction error as a benchmark (see Theorem 5.6) 
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Figure 3: Spatial polygons 


Table [5] shows that when f3 is fixed and a increases, resulting in an increase of the 
mean of the target variables at region level (ct|f2| + /3xn), all errors get larger. For fixed 
coefficients a, f3, the errors increase from the first set of sources to the third one, which is 
natural since the available information decreases from 14 observations for the first case, 
to 4 for the third. 

In accordance with the toy example of section 4.4 the errors for X 2 are much smaller 
than the ones for X\ for the areal weighting and the dasymetric methods due to the 
difference of homogeneity of the auxiliary variables. The more homogeneous the auxiliary 
variable is, the more accurate the areal weighting interpolation and dasymetric methods 


22 


































































Table 4: Imbalances 


Case 

Sources 

Min 

Mean 

Max 

a = 100 

14 sources 

-0.92 

-0.40 

0.96 

0 = 1 

7 sources 

-0.90 

-0.59 

0.64 

x, 

4 sources 

-0.87 

-0.44 

0.64 

a = 600 

14 sources 

-0.62 

0.11 

0.99 

0 = 1 

7 sources 

-0.53 

-0.05 

0.93 

Ad 

4 sources 

-0.43 

0.07 

0.93 

a = 1000 

14 sources 

-0.43 

0.29 

1 

0 = 1 

7 sources 

-0.33 

0.17 

0.96 

Ad 

4 sources 

-0.21 

0.26 

0.96 

a = 1000 

14 sources 

0.6 

0.86 

1 

0 = 0.1 

7 sources 

0.67 

0.84 

1 

Ad 

4 sources 

0.74 

0.86 

1 


Case 

Sources 

Min 

Mean 

Max 

a = 100 

14 sources 

-0.63 

-0.6 

-0.58 

0 = 1 

7 sources 

-0.61 

-0.6 

-0.59 

Ad 

4 sources 

-0.6 

-0.6 

-0.59 

a = 600 

14 sources 

0.16 

0.21 

0.22 

0 = 1 

7 sources 

0.18 

0.2 

0.22 

Ad 

4 sources 

0.19 

0.2 

0.22 

a = 1000 

14 sources 

0.39 

0.43 

0.45 

0 = 1 

7 sources 

0.41 

0.43 

0.44 

Ad 

4 sources 

0.42 

0.43 

0.44 

a = 1000 

14 sources 

0.92 

0.92 

0.93 

0 = 0.1 

7 sources 

0.92 

0.92 

0.93 

Ad 

4 sources 

0.92 

0.92 

0.93 


are. As we discussed earlier, if the auxiliary information is almost homogeneous, the 
regression might be less exact than the areal weigthing. But Table 4 shows that the 
regression methods are still quite good for X 2 : in general, they are better than the two 
classical methods except in some particular cases. The errors of the regression and scaled 
regression methods are very comparable for X\ and X 2 . Indeed, the prediction errors of 
the regression are very similar to the mean of the target variables Y and the accuracy of 
the scaled regression predictor is equivalent to the one of the composite predictor. 

The effect of imbalance can be studied by looking at a change of a with fixed 0. A 
larger a corresponds to a larger influence of the areal effect or| -S'! which is expected to lead 
to the domination of the areal weighting interpolation method (indeed we can observe this 
effect in the table for both auxiliary variables X\ and X 2 ). The imbalance also affects the 
regression and scaled regression methods: if one of the two effects a\ S and f3Xg is much 
larger than the other one, the corresponding errors seem to be further from their bench¬ 
marks (respectively the mean of Y and the composite prediction error): see for example 
the cases a = 1000, 0 = 0.1. However this effect is not very large for regression and scaled 
regression. One factor which influences more these two methods is the homogeneity of the 
auxiliary variable: comparing the results for Xi and X 2 illustrates this. For Ad, the re¬ 
gression prediction errors are almost equal to their respective benchmarks and the amount 
of initial information does not seem to have a big influence (the errors are not monotonic 
from the first to the third set of sources). For X 2 , the accuracy increases with the number 
of source zones (the best being for the first one) and the errors of the regression method 
tend to the mean of Y. If we consider the particular case a = 1000,/! = 0.1 for Ad, the 
areal impact is much stronger than the auxiliary information impact, and we see that the 
areal weighting interpolation is the best method, and that the scaled regression predictors 
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Table 5: Square root of prediction errors. 


Methods 

VHYn) 

Sources 

DAW 

DAX 

REG 

ScR 

Composite 

a = 100 


14 

sources 

6580.5 

1621.6 

353.0 

333.1 

332.1 

P = l 

354.7 

7 

sources 

6962.7 

2087.7 

352.1 

341.3 

341.2 

A, 


4 

sources 

7215.7 

2090.6 

352.1 

347.6 

347.8 

a = 600 


14 

sources 

6589.6 

9532.5 

500.6 

481.9 

482.1 

(3 = 1 

503.8 

7 

sources 

6971.3 

12363.7 

502.9 

493.4 

491.7 

X 1 


4 

sources 

7225.3 

12374.7 

502.5 

498.8 

497.4 

a = 1000 


14 

sources 

6597.3 

15878.2 

594.0 

574.5 

574.2 

0 = 1 

596.9 

7 

sources 

6982.2 

20595.2 

595.6 

586.5 

584.6 

x 1 


4 

sources 

7229.6 

20614.5 

594.4 

590.8 

590.3 

a = 1000 


14 

sources 

826.7 

15875.6 

513.6 

500.7 

500.9 

p = 0.1 

515.8 

7 

sources 

861.9 

20592.8 

515.0 

509.4 

508.3 

X\ 


4 

sources 

883.5 

20612.4 

514.5 

512.3 

511.6 

a = 100 


14 

sources 

458.8 

353.5 

356.6 

348.3 

344.5 

(3 = 1 

354.4 

7 

sources 

469.0 

358.6 

357.7 

354.2 

349.5 

x 2 


4 

sources 

474.2 

361.3 

359.6 

358.2 

351.6 

a = 600 


14 

sources 

573.9 

677.4 

504.9 

492.9 

489.6 

(3 = 1 

503.6 

7 

sources 

587.3 

690.0 

508.1 

503.1 

496.6 

X-2 


4 

sources 

591.2 

697.7 

509.1 

507.2 

499.6 

a = 1000 


14 

sources 

654.8 

969.9 

599.0 

585.0 

580.1 

(3 = 1 

596.7 

7 

sources 

665.5 

992.6 

601.7 

595.7 

588.4 

X-2 


4 

sources 

671.8 

1002.6 

602.2 

600.0 

592.0 

a = 1000 


14 

sources 

502.9 

924.7 

518.6 

506.8 

501.4 

p = 0.1 

515.8 

7 

sources 

510.3 

947.4 

520.4 

515.5 

508.6 

a 2 


4 

sources 

512.9 

960.9 

522.2 

520.2 

511.7 
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Methods 

E(y n ) 

Sources 

DAW 

DAX 

REG 

ScR 

Composite 

a = 100 


14 

sources 

50.459 

47.464 

6.979 

6.872 

6.873 

P = 1 

125847 

7 

sources 

55.664 

54.021 

6.970 

6.941 

6.951 



4 

sources 

57.122 

54.054 

6.978 

6.966 

6.969 

a = 600 


14 

sources 

20.458 

59.087 

3.492 

3.422 

3.421 

P = 1 

253847 

7 

sources 

21.865 

73.067 

3.495 

3.471 

3.466 



4 

sources 

22.557 

73.208 

3.490 

3.480 

3.481 

a = 1000 


14 

sources 

14.859 

62.210 

2.813 

2.754 

2.757 

P = 1 

356247 

7 

sources 

15.827 

77.405 

2.823 

2.801 

2.794 

x 1 


4 

sources 

16.324 

77.589 

2.820 

2.811 

2.808 

a = 1000 


14 

sources 

4.297 

72.322 

3.093 

3.018 

3.020 

p = 0.1 

266025 

7 

sources 

4.433 

89.662 

3.100 

3.069 

3.064 

X l 


4 

sources 

4.508 

90.028 

3.098 

3.086 

3.082 

a = 100 


14 

sources 

5.685 

4.503 

4.545 

4.438 

4.391 

0 = 1 

125608 

7 

sources 

5.799 

4.569 

4.559 

4.514 

4.455 

X-2 


4 

sources 

5.859 

4.603 

4.581 

4.564 

4.482 

a = 600 


14 

sources 

3.574 

4.121 

3.186 

3.108 

3.088 

0 = 1 

253608 

7 

sources 

3.655 

4.196 

3.205 

3.174 

3.134 

X‘2 


4 

sources 

3.679 

4.236 

3.211 

3.199 

3.153 

a = 1000 


14 

sources 

2.918 

4.084 

2.692 

2.627 

2.606 

0 = 1 

356008 

7 

sources 

2.966 

4.172 

2.704 

2.677 

2.645 

x 2 


4 

sources 

2.993 

4.211 

2.706 

2.696 

2.661 

a = 1000 


14 

sources 

3.022 

5.133 

3.118 

3.045 

3.015 

p = 0.1 

266001 

7 

sources 

3.068 

5.246 

3.129 

3.099 

3.059 

a 2 


4 

sources 

3.084 

5.320 

3.140 

3.128 

3.078 


Table 6: Relative prediction errors (in percentages). 
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can catch up the areal weighting interpolation when there are more source zones. 

Table [6] contains the corresponding relative errors. For example, in the case a = 1000, f3 = 
1 for X 2 , we see that whatever the number of sources the relative error is around 2.6% 
to 2.7% for the scaled regression (very close to the benchmark given by the last column) 
whereas dasymetric is around 4% and areal weighting around 3%. Looking at the second 
column, we see that when the expected number of points increases, the relative prediction 
error tends to decrease which was naturally not the case for the prediction error itself. 
We now turn attention to the robustness of the methods with respect to the model. 
As previously with the same geometrical design, we generate two auxiliary information 
scenarios: A"! is as in the previous simulation, and A3 is inhomogeneous and uncorrelated 
with Xi (correlation coefficient of —0.16). A target variable Y is generated from A3 with 
the relationship Ya ~ 'P(600|A| + A3). We transfer Y from the first set of 14 sources 
to the cells (Figure [573 ) by using areal weighting interpolation, dasymetric interpolation 
with X\ and A3 as auxiliary variables, the regression methods (REG and SCR) with the 
true model (areal effect and A 3 ), a simple model with only the areal effect, an auxiliary 
variable model with an irrelevant variable (with area and A ( ), an auxiliary variable model 
involving an unnecessary variable (the area and both Ai and A3). Table [ 7 ] presents the 
results. 


Methods 

Relative error 

DAW 

7.74 

DAX with A 3 

9.49 

REG with area and A3 

2.66 

ScR with area and A3 

2.62 

DAX with Xi 

14.70 

REG with area and X\ 

10.48 

ScR with area and A"! 

8.26 

REG with area 

10.62 

ScR with area 

7.74 

REG with area, Xi and A3 

2.66 

ScR with area, Xi and AG 

2.62 


Table 7: Robusness of methods. 

The most accurate method is the scaled regression with area and A" 3 (true model). 
Note that the relative error for DAW and ScR with area only is the same which was 
expected since we proved that in that case the two methods coincide. The regression 
methods for the model involving area plus Ai and X3 as auxiliary have the same errors 
(2.66% and 2.62%): in other words using unnecessary variables in the regression does not 
decrease the accuracy. On the other hand, if we use the regression with a wrong choice of 
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auxiliary variable, it gives bad predictions (10.48% and 8.26% for the model with area and 
Xi, 10.62% and 7.74% for the model with only areal effect). The dasymetric method with 
X :j is better than with X\ (9.49% vs 14.70%) which makes sense because the correlation 
of the target variable Y with X 3 is 0.998 while with X\ it is of —0.159 however we see 
that despite the strong correlation between Y and A" 3 the dasymetric method with A" 3 is 
not so good because the areal effect is strong. The scaled regression is always better than 
the regression method and the scaled regression in the case of areal effect model yields 
the same result as the areal weighting interpolation method. 

6 Conclusion 

In this paper we have analyzed the accuracy of four areal interpolation methods: areal 
weighting interpolation, dasymetric interpolation, Poisson regression and scaled Poisson 
regression for the case of count data. We have introduced a model based on an underly¬ 
ing Poisson point pattern to be able to evaluate the accuracy of the different methods. 
We have proposed a scaled version of the Poisson regression method resulting in the 
enforcement of the pycnophylactic property. Areal weighting interpolation and dasymet¬ 
ric interpolation have been compared with a finite distance approach and the regression 
methods have been compared together and with the previous ones with an asymptotic 
approach. 

We found out that one shouldn’t rely on the correlation of the target variable and the 
auxiliary variable or on the homogeneity of the target variable to decide between areal 
interpolation or dasymetric but we should also take into account the relative imbalance 
between the areal effect and the auxiliary effect. A strong areal effect leads to the domi¬ 
nance of the areal weighting interpolation and a strong auxiliary effect is in favor of the 
dasymetric method. Moreover, the imbalance index allows to approximate the ratio of 
the two relative errors and their lower bounds as the number of points on the source zones 
gets large. We establish the formula for the best linear predictor (therefore better than 
the areal weighting and the dasymetric), which leads to the introduction of the scaled 
regression method. 

For the comparison of areal weighting and dasymetric, a combination of several factors 
explains the complexity of the behavior: the size of sources, the auxiliary information, the 
number and size of target zones, ... The error at source level is better when sources are 
divided into a smaller number of target zones. A large number of points makes the error 
at source level worse but improves the accuracy of the relative error. These two types of 
errors have the same behavior as a function of the imbalance index. The impact of the 
expected number of points and of the inhomogeneity on the comparative advantage of the 
methods should not be forgotten: indeed when we have several sources, the sign of the 
imbalance index may vary from source to source and the overall effect, being an aggregate 
of the source level effect, will also depend on the magnitude of the source error differences 


27 



which is driven by the expected number of points and by the inhomogeneity. We proved 
that the accuracy of the unfeasible composite predictor is decreasing when the expected 
number of points are similar on all targets and this fact extends to scaled regression (due 
to the approximation results). 

To be able to include the regression methods in the comparison, we need to resort 
to some asymptotic approach. We propose an asymptotic framework and prove that 
the Poisson regression prediction error is equivalent to the variance of the underlying 
process and for the scaled regression, it is approximated by the composite’s prediction 
error. These results show the regression predictor is not automatically better than the 
areal weighting interpolation or the dasymetric method, but when the number of points 
at source level is large, it is in general better. Finally the scaled regression turns out 
to be the best one among the considered methods. These results are confirmed by our 
simulation study of the last section. The robustness with respect to the model is also 
considered. The simulations show that a model with extra auxiliary variables doesn’t 
create any loss while missing variables or unrelated variables (in place of the correct ones) 
decrease the accuracy of all methods. 


7 Appendix 

7.1 Proofs 

7.1.1 Proof of Lemma 14.11 and lemma 14.21 

From (|8]), (|9]) and the properties of a Poisson point process we have 

E( yDAW _ Yt) = E( !|ly s _ yj = M ( Q | S | + _ (q | T | + fa) = 

E (Y t dax - Y T ) = n—Y s - Y t ) = ^(a|S| + f)x s ) - (a|T| + f)x T ) = a|S|(^ - j|j) 

Xs X S X S |5| 

Taking into account the independence of two disjoint target zones with the fact that 
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the target T is a portion of the source S the variances of each method are given as follows 


Var(Y° AW - Y t ) = - Y T ) 

= S' Var(Y s ) + Var(Y T ) - 2 j^Cov(Y s , Y T ) 

= ® E(Y S ) + E(Y t ) - 2 ^Cov(Y S \t + Y t , Y t ) 

irl 2 irl 

= LLE(y s ) + E(y T ) - 2y-Var(Y T ) 

= ]?rL(a|‘Sl + fas) + («I T I + far) ~ 2®-(o;|T| + 0x T ) 

Pr \ s \ 

=^S-S )J+fe - (1 -S )+ “ |r|(1 -S ) 

Var(YrP AX - Y t ) = Var(—Y s - Y T ) 

xs 

2 

= %Var{Y s ) + Var(Y T ) - 2 ^Cov(Y s , Y T ) 
x% x s 

2 

= ^E (Y s ) + E(Y T ) - 2 —Cov{Y S \t + Y T , Y T ) 

X S X S 

2 

= ^E(Y S ) + E(Y t ) - 2 — Var(Y T ) 
x s x s 

2 

= ^-(a\S\ + Px s ) + (a\T\ + px T ) - 2—(a\T\ + f3x T ) 
x s x s 

= a|S|(rS - -? + PM 1 - Y +a|T|(l - jh) 

\S\ x s x s |h| 

Summing up the variances at target level with the fact that j^j = ^ = 1, we 

get the variances at source level 

Var° AW = Var(Y° AW - Y T ) 

T 

= Efe(S-S) ,+ ^ (1 -S )+ap, ic 1-0) 

. fe ?( m _|) 2 +/ tes ( 1 _ ? 4 ) + a , S | ( 1 _ ? g ) 
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Var° AX = Y Var{Y° AX - Y r ) 

T 




Xt_ 

xs 


) 2 + f^xsii - Y, ~y) + a l^1(i _ 

T S T 


\Tf 

| 5| 2 


7.1.2 Proof of Theorem 14.31 


From Lemma 4.1 and the fact that cr|-S'! = /sd.l)E(Ys'), fiXs 


I S (X)E(Y S ) we have 


Er“ B ' = 7 s (A')E(y s )(Fl - P 2 + / s (A)E(y s )(— - % + / s (|.|)E(y s )(E - FI) 

PI %S x s Xg p| p| 

+ / s (A) 2 E(y s ) 2 (jl| - g) 2 

Er? - = / s (M)E ( y.)(^ -1) 2 + i s (xnmf s - j)+/ s (i.i)EPi)(|H - g) 
+ ^s(M) 2 E(y s ) 2 ([h-—) 2 

p| X S 


If the expectation of the number of points is sufficiently large, we can approximate the 
ratio of the two errors as follows 


and also 


Er° AW 

„ w 2 

Er° AX 

~ Is( U) 2 

Re? AW 


Re? AX 

~ Is( l-l) 


At source level, we get a similar result by adding up errors on all target zones using the 
fact that j—j = J2t ~ = 1 


\S\ 


xs 
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Er?-™' = I S (X)E(Y S ) E( jfj - + /s(X)E(y s )(l - E (|) + ^(M) E »)(1 - E jfp) 

+ / s (.Y) 2 Efy s ) 2 E(|S-— f 

T SI X S 

Er^- V = / s (|.|)E(y s )(i - E — ) 2 + 'i(W«)(l - E 4) + / s (M)E(y s )(i - E S) 

T x s T x s T PI 

+ / s (|.|) 2 E ( y s ) 2 E (||-|) 2 


Re?-" 1 ' 


■ ife 1 '-™ ?<IH - S' 3 1+ w" 

mi-iki -ES> 2 +« A ')(! - E 4)+wi-ih 1 - E IS)] 


|3T 


Re 


DAX 


E(y s ) 


x s' ' ' V x s . V 


h{ m) 2 E' 


\T I X T\ 

SI x s ' 


Using the relationship Is (|■ |) + Is(X) = 1, the above results prove Theorem 4.3 

7.1.3 Proof of Theorem 14.41 


Lemma 4T yields 

Etf AW - Ey% ax = Var(Yf AW - Y T ) + [E (Y° AW - Y T )\ 
- Var(Y° AX - Y t 


\T\ 

X T 

Sf 

X S 

\T\ 

X T 

Isf 

X S 

\T\ 

X T 

Sf 

X S 

r 

X T 

Sf 

X S 


[E(Y^ ax - Y t )] 

g) 2 (te - a|S|) + ({|f - g) 2 (/? 2 4 - a 2 |S| 2 ) 

— ) 2 (/3x s - aSIXC^s + a\S\ + 1) 
xs 

-) 2 fU S ~SI| (( fe + “I s ] + !)(fe + a l s l) 

x s (px s + a|S|) 
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7.1.4 Proof of Theorem 14.51 


We calculate the error of the composite predictors then minimize with respect to w to 
find the optimal w* 


Yf = wY? AW + (1 - w)Y° ax = [wj|| + (1 - w)—]Y s := uY s 

I*- 1 1 x s 

Bias? = [E(Y^ aw - Y^ AX )f = ( u\ s - X T ) 2 
Var T = Var{uY s — Y T ) = u 2 \s + At — 2u\t 
Er-r = u"\s(\s + 1) — 2uXt(Xs + 1) + Ag + Xt 


■&W* 

<^w* 


■ t? ^ T 

argmin u \hiT = — 
As 


m+d 


w 


X_T 

Xs 


a\T\ 


Gris'! + P x s 


a\T\ + /3xt 
cr|S| + f3x s 


Substituting the w* in (21) we get the composite predictor (22). 

The bias, variance and error of the above composite predictor are calculated as follows 


Bias = E (Yf 
Erg = Var(Yf 


Yt) — 0 


= l '"rA(Vs 

-Y t ) 



= fvarft) 

A s 

+ Var(Y T ) 

9 At 
A s 

Cov(Y s ,Y t 

Ag 

= vyA s + X T 

- 2 W t 

AS 



X 2 

Arp 

— at — — 

As 




'2 

= -ix s + A t 

— 2 —At — 

'2 

x T . 

“2 ~ A S 

+ 2 —A t — 

x s 

X S 

x s 

xs 

rr .2 

= ~jXs + A t 

— 2 —At — 

XsX-2 

-w) 2 

x s 

xs 

Xs 

A s 

= Var(Yf AX 

— Y t ) — A^l 

/X T 

V> 2 

K x s 

A s 

= Var(Yf AW 

— Yt) — X s 

( \T\ 

Vi 

ATn2 

A s 
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Since 


Y T \Y s ~Bi(Y s , 


E(y T ) 
E (Ys)' 


we have 


E < y tu) = H Fs = h° 

This shows that the composite predictor is the best linear predictor. 


7.1.5 Proof of Theorem I5.ll 

To prove the theorem, we will prove the following lemmas 

_-j /o 

Lemma 7.1. Under conditions (Cl) and (C2), the normed score function F n 7 s n is 
asymptotically normal 

Fc 1/2 s n ^ d Af (0,1) (34) 

Lemma 7.2. Under conditions (Cl) and (C2), for all 6 > 0 

max jeNri ( S )\\Vn('y) - I|| 0 (35) 

where N n (S) = {7 : ||E ™ /2 (7 — To) 11 < <5}, K(t) = FF l/2 H n (^)FF 1/2 . 


Lemma 7.1 is proved by using the Lindeberg-Fellcr theorem. 
Indeed, for r fixed with t't = 1, considering the triangular array 


= t'FF 1 / 2 - 


J n,i 


7 'Zn 


(?/n,i T ^n,i ) 


(36) 


we have 


E (z n ,i) = 0 
y Var(z n ,i) = 1 

i 

We will show that the Lindeberg condition is satisfied, i.e. for any e > 0 


(37) 


as n —> 00 . 

1 n,i 


T + / 771—1/2 ^ n,i 

Let a n .i — rF n 


i'z ., 

V^) 2 l| 2/Tl ,_yz n .|>^) , yields 


, because z 2 n , L = a 2 ni (y n ^-iZ^f) 2 , E(^l| Zrii | >£ ) = a 2 n ^((y n — 
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— y ' a n,i^((yn,i 7 Z n,i) 1 | y n>i -j'i 


r ) 


i 7' Z n,i\>\a nA \ ■ 

l 

— 0 1 a n4) SU Pi^‘({yn,i ~~ 7 Z n,i) ^-'y ut -y'X, M ; > — ) 


l a n,i I 


Moreover, condition (Cl) yields that there is a positive number Ah s.t. -yi— < 

1 ^n,i 


Ki,V(n,i ), hence 


F 7. = ’-'-F 72 E 

i i Z n,i 


In addition, conditions (Cl) (C2) lead to 


maxi- 


—> oo 


as n —>■ oo, hence for any M > 0, 3ni s.t. Vn > n\ 

supiE((y n}i — 7 Z n ,i) 1\ _^z \>*) < supiE((y n:i — 7 Z n>i ) 1 \ y i - 1 'z n<i \>M) 

\ Uj n,i I 


< supiJE((y nti - 7'Z nii ) 4 E(l |ynj ._ 7 ,2„_. |>Af ) 


< supA E((y n>i - yZ n}i ) 


4 Var(y n ,i - 7 / Z„ ii ) 
M 2 


= supa/ 7'^n,i(l + 3 7 'Z n ,j) 


M 2 


< 


Ay 

M 


—> 0 as n —>• 00 


hence 

SU Pi E((yn,i - 7 , ^n,i) 2 l| w „ ._yz nii |>„ 
where the existence of A '2 is derived from condition (Cl). This argument shows that the 


(37) holds. So does Lemma 7.1 
Proof of Lemma 7 .2 

Using the same notation in the proof of Lemma 7.1, r fixed s.t. r'r = 1, let b n j = 
T'Fn 1/2 Z nti , the equation (35) can be rewritten as 


T\V n { 1 )-l)T = A n + B n + C n 


(38) 
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where 


2 ( 

n.i \ 


A n = rtn - 

V * (7 ^n,i) 2 Wn,) 

Bn — 'y ] b n i ~ — {y n ,i — 7 Z n,i) 

i Uo Z n,i) 2 


~){yn,i 7 Z n,i) 


Cn-J2 b U~ 


(39) 

(40) 

(41) 


7 Z n,i ^lo Z n,i 

We will prove that the three terms converge in probability to 0 as n tends to oo. To 


prove (40), we first study its properties. We have 

E(£? n ) = 0 

Var(B n ) = ^2 b n,i f ,1 yar(y n}i -iZ n ^) 

i w o^n,i) 

1 b 2 

/ 7 2 x n ,2 / ^ 

< SU P ( ,~v \ 3 7 Z n,i 

“ '~lo Z n,i i \/o Z n,ij 

= ^Pb 2 i 3 .„ 7 / ^n,i < ^3 sup 6^ 

* (7o^n,i ) 3 * 

Because of the boundedness of (7 (,Z nji ) 3 and the dehnition of N n (S), yz n)i is bounded 
when n is large enough, moreover, sup, b 2 t —> 0 due to the condition (Cl) (C2), therefore 

B n —> p 0 

We can use similar argument to prove A n —> p 0, C n —> 0, and this shows that the 
lemma 17.21 holds. 


7.1.6 Proof of Theorem 15.31 

Let 


This yields 


%nij ~ B ( 7 0 Z n ,i) 7 o Z n,i ■ %n,ii j 1; 2, ..., k n i.i.d 


y ' ^nij f n.i 7o Z n 


We have 
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E(z n ij) = 0 

^ ^ VOiT^Znij) ^fo Z n,i 
j 

We will prove that this array satisfies the Lindeberg-Fellcr condition, i.e. V<5 > 0 

y |> 5 ) —» 0, as n —» oo 
j 

Indeed, 

y ' ^‘( Z niiMznii\>s) = k'nM‘(z n: il\z rl} i\>s) = ^{ U n,i^-\u nti \>sfk^8) 
j 

where u n ^ = \[knZ n ^. Because E u H)i = 0, E v? ni = Varu n = k n Varz nt i = Y 0 Z n j < oo. 
Moreover k n —* oo as n —)■ oo, we have 

^( M ri,*l|'U rlii |>v / ferl<5) ~^ 0 


as n —)■ oo 

From the Lindeberg-Feller theorem we get 


Y ■ - V Z ■ 

1 n,i I 0 ZJ n,'i' 


l'o Z n,i 


~*d N ( 0 , 1 ) 


This proof can be applied at the target level, i.e 

Y t — 


->d m, ■*■) 


l'o Z l 


7.1.7 Proof of Proposition [572] 

The pycnophylactic property of the scaled regression predictor is obvious. 

To prove the pycnophylactic property of the regression predictor at region level, we 
sum up regression predictors over source zones 

r n = E E h fBO = E E 7nZr = %Z n 

i T:TcS n ,i i T:TcS n ,i 
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Recall that 7 is the solution of the score equation s n (q) = 0, i.e. 


v - Zn, = 0 

i =1 7 

Tl ^ t ry 

\ - 7 L n i 

=> y, Vn,i - 7 Z n,i = 0 

1^1 7 ' Z n,i 
n 

^ ^ ^ Vn,i 7 Z £l = 0 

i =1 

<^■7 ' Z n = yn 


In other words, the regression predictor satisfies the pycnophylactic property on the region 

n. 

To study the pycnophylactic property of the regression predictor at source level, we 
consider 

1 yREG _ y 

1 n,i 1 n,i 

We have 


-yREG _ y 

1 n,i 1 n,i 


7nZn,i 

iffn ~ io) Z n,i ~ (Yn,i ~ lo Z n,i) 


7 o Z n,i) 


The first term converges to 0 in distribution due to the conditions (Cl), (C2) and the 
theorem 5.1 The second term is different from 0, even asymptotically (Proposition |5.2[ ). 
Moreover, because of the boundedness of Z ntl the above argument yields 


yREG _ y 

1 n n 1 r, 



—*d -A/"(0, 1) 


This completes the proof of proposition 5.2 


If Z T is bounded below, a similar result at target level holds 


Yf EG - Y t 

\j l'o Z T 


~^d A/"(0,1) 


7.1.8 Proof of Theorem 15.41 

For any target T, the error of the regression predictor on the target is 
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E « 


REG 


Y T f = E (7 ' n Z T - i 0 Z T f + E {j 0 Z T - Yr) 2 - 2 E tf n Z T - i 0 Z T )W 0 Z T - Y T ) 


From Theorem 5.1 and condition (Cl), for any rji > 0, > 0 s.t.when n is sufficiently 

large 


E (7n^T - ^ 0 Z T ) 2 l\\^ n - lo \\ <£ < Tji 

l|2E(7 ' n Z T - 'y' 0 Z T ) {^' 0 Z T - ^t)1||7„— 7 0 ||<e| I < Vi 


As we proved in Proposition 5.2 ( 7 ' h Zt — 7 ' 0 Zt) —> P 0, we have 

P(|| 7 n — 7 o|| < e) —> 1 as n —> 00 

In addition 

E( 7 ;z T - y t ) 2 = i 0 z T 

Hence there is n\ s.t. 

E( 7 'Z T - Vt) 2 i w „_ 7o ||>, < e( 7 'z t - n) 4 P(ll7. 

for n > n\. In other words, 

l'o Z T - Vl < E(7 o Z T - ^T) 2 l||7 n -7 0 ||>£ < l'o Z T 

This implies Vr/ > 0, > 0, ri\ s.t. for n > rti 

+ IqZt < E (Xt - Yt ) 2 l||7 n —7 0 ||<£ < V + i' 0 Zt 
with a remark that P(||7 n — 7o|| < e) 1 as n — * 00 . 


Toll > e) < Vi 


Combining (42), (43), (44) we get Theorem 5.4 


(42) 

(43) 


(44) 


7.1.9 Proof of equations (30) 

We rewrite the error of the areal interpolation and dasymetric for the asymptotic 


model. For a target T C S n< i, from (23), (24), and Lemma 4.1 we have 


daw _ q 2 ~ _./ v \ 1 'o Z t) 


Er^ M = / 3 2 x n , n ' 0 Z T 


+ 7 o Z n,i( 


£L - 7IW+ P 2 xi(YX - A f 

|S„,i| AZn,i |S„,i| X n ,i 


Er 


DAX 


7 o^t ~ 


Yo Z n , 

(7 o Z t) 2 , lo^T ^2 , „2| C ft 

— - +7o4m(~ -7^—) +« |*n,i| L 

7 o Z n,i x n,i Yo Z n,i j S', 


XT l'o Z T n 2 , „ 2iTT | 2 / 1^1 


X T y 
%n,i 
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A similar argument as in the proof of theorem 
s.t. Vrt > ri\ 


5.4 shows that, for any rji > 0,£ > 0, 3ni 


p DAW ^ ]R U~yDAW -y \2-i , rp DAW 

hji T ~ Vi < — Yt) l|| 7n _ 7o ||< £ < YiT t 

"' An4v <Ei eax 


Ei° ax - m < 


. E(Y° ax - U) 2 l||,„_ 7 „|,< e 
theorem 5.4, let Qj = {|| 7 n — 7 „|| < e}, we have 

_ D ZT’/"' _s Q _ 7~> /I T XT _. O t 


With £ chosen as in 

7 'Jr - Er° AW -r,< E (Y EEG - Y T ) 2 l Qi - E (Y EAW - Y T ) 2 l Qi < j 0 Z T - Yx EAW + r, + r lx 
i 0 Z T - Ex eax - V < E{Y eeg - Y t ) 2 l Qi - E (Y EAX - Y T ) 2 l Qi < i 0 Z T - Ei eax + rj + r h 


7 o Z T 


Er Em = /3x n!i ^f- + a\S nji y _ 2 

r ' \S nti \ 


for all n> n\. Moreover, 

17,2 ■ - Y? - PxuYY - Y? 

15 -I %n,i \S -I 

7 oZt ~ Er? AW = Px nti ?f + a\sZ\^- 2 - a\sZ\(J^L - ^ ) 2 - a 2 \sZ\\ML Xt ' 2 

| S Uii | |^| *»•* |5 Bii 


n,i 


%n,i 
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Taking the sum over all target zones which belong to S n> i then scaling the sum by E(K n7 ; 
and calculating the differences in terms of A n> j = A s ni , we have 
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metric asymptotically. If this difference increases, the difference between the regression 
and the other two methods gets smaller and then the regression method can do better 


than the other two methods. Indeed, for example when jZ- = ^ ^’ 2 / 3 15*^ 


, this yields 
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Choosing 77,771 to be sufficient small, the regression predictor is asymptotically better than 
the areal weighting interpolation predictor. A similar result for the case of the dasymetric 
predictor can be proved similarly. 

We therefore proved that none of the considered three methods is always dominant. 
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7.1.10 Proof of Lemma 15.51 

Assume T e S n _i, the difference between the predictors of scaled regression and com¬ 
posite predictor is given by 
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where 77 belongs to the segment of 7 n and 7 0 


From Theorem 5.1, property (28), conditions (Cl), (C2), we have 
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7.1.11 Proof of Theorem 15.61 

Because of the boundedness of Z Uii , upper boundedness of Zt, there exists 
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where B( 7 0 ,1) = {7 : 117 0 — 7 I | < 1}. Since 7 „ — 7 0 —> p 0, the sequence 7 n , n = 1 , 2 ,... is 
bounded, therefore for any e > 0 , when n is large enough 
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Evaluating the error on the set {|| 7 n — 7o|| < £}, we have 
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Moreover 
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With the same argument as in theorem 5.4, when n is large enough 
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Using a similar argument as above, we can prove V?/ > 0, 3e > 0 and n large enough 
such that 
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In other words 
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Note that P(|| 7 n — 7 o|| < e) —> 1 as n — > oo and the theorem holds. 
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